一般在建立分类模型时,当我们进行特征工程的工作经常需要对连续型变量进行离散化的处理,也就是将连续型字段转成离散型字段。离散化的过程中,连续型变量重新进行了编码。特征离散化后,模型会更稳定,降低了模型过拟合的风险。本文主要介绍3种常见的特征分箱方法:

  • 等宽分箱
  • 等频分箱
  • 聚类分箱

下文以一份虚拟的成绩数据为例,进行原理讲解和代码介绍。

import numpy as np
import pandas as pd
np.random.seed(1)

n = 20
ID = np.arange(1,n+1)
SCORE = np.random.normal(80,10,n).astype('int')
df = pd.DataFrame({'ID':ID,'SCORE':SCORE})
ID SCORE
0 1 96
1 2 73
2 3 74
3 4 69
4 5 88
5 6 56
6 7 97
7 8 72
8 9 83
9 10 77
10 11 94
11 12 59
12 13 76
13 14 76
14 15 91
15 16 69
16 17 78
17 18 71
18 19 80
19 20 85

使用sklearn库KBinsDiscretizer类进行上述分箱操作。

  • 参数n_bins参数上指定需要分箱的个数,默认是5个
  • strategy指定不同的分箱策略strategy:KBinsDiscretizer类实现了不同的分箱策略,可以通过参数strategy进行选择:
    • 等宽:uniform 策略使用固定宽度的bins;箱体的宽度一致
    • 等频:quantile 策略在每个特征上使用分位数(quantiles)值以便具有相同填充的bins
    • 聚类:kmeans 策略基于在每个特征上独立执行的k-means聚类过程定义bins。
  • encode参数表示分箱后的离散字段是否需要进一步进行独热编码或者其他编码处理

KBinsDiscretizer类只能识别列向量,需要将DataFrame的数据进行转化:

score= df['SCORE'].values.reshape(-1,1)

等宽分箱¶

下面将数据分成3部分

from sklearn.preprocessing import KBinsDiscretizer

dis = KBinsDiscretizer(n_bins=3,
                       encode="ordinal",
                       strategy="uniform"
                      )
label_uniform = dis.fit_transform(score)  # 转换器

模拟数据中的最小值为56,最大值为97,所以最终分成的三等分箱的边界值为[56. , 69.66666667, 83.33333333, 97. ]

等频分箱¶

等频分箱指的是每个区间内包含的取值个数是相同的。下面同样取3箱分割

dis = KBinsDiscretizer(n_bins=3,
                       encode="ordinal",
                       strategy="quantile"
                      )

label_quantile = dis.fit_transform(score)

聚类分箱¶

聚类分箱指的是先对连续型变量进行聚类,然后所属样本的类别作为标识来代替原来的数值。

dis = KBinsDiscretizer(n_bins=3,
                       encode="ordinal",
                       strategy="kmeans"
                      )

label_kmeans = dis.transform(score)  # 转换器

对比¶

df["label_uniform"] = label_uniform
df["label_quantile"] = label_quantile
df["label_kmeans"] = label_kmeans

参考资料

  • https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html
  • https://mp.weixin.qq.com/s/H19asF7Qo_0Wc5FIn8Qkww

In [1]:
import numpy as np
import pandas as pd
np.random.seed(1)

n = 20
ID = np.arange(1,n+1)
SCORE = np.random.normal(80,10,n).astype('int')
df = pd.DataFrame({'ID':ID,'SCORE':SCORE})
In [2]:
df
Out[2]:
ID SCORE
0 1 96
1 2 73
2 3 74
3 4 69
4 5 88
5 6 56
6 7 97
7 8 72
8 9 83
9 10 77
10 11 94
11 12 59
12 13 76
13 14 76
14 15 91
15 16 69
16 17 78
17 18 71
18 19 80
19 20 85
In [3]:
print(df.to_markdown())
|    |   ID |   SCORE |
|---:|-----:|--------:|
|  0 |    1 |      96 |
|  1 |    2 |      73 |
|  2 |    3 |      74 |
|  3 |    4 |      69 |
|  4 |    5 |      88 |
|  5 |    6 |      56 |
|  6 |    7 |      97 |
|  7 |    8 |      72 |
|  8 |    9 |      83 |
|  9 |   10 |      77 |
| 10 |   11 |      94 |
| 11 |   12 |      59 |
| 12 |   13 |      76 |
| 13 |   14 |      76 |
| 14 |   15 |      91 |
| 15 |   16 |      69 |
| 16 |   17 |      78 |
| 17 |   18 |      71 |
| 18 |   19 |      80 |
| 19 |   20 |      85 |
In [4]:
from sklearn.preprocessing import KBinsDiscretizer
KBinsDiscretizer(n_bins=5, 
                 encode='onehot',
                 strategy='quantile',
                 dtype=None)
Out[4]:
KBinsDiscretizer()
In [5]:
score= df['SCORE'].values.reshape(-1,1)
score
Out[5]:
array([[96],
       [73],
       [74],
       [69],
       [88],
       [56],
       [97],
       [72],
       [83],
       [77],
       [94],
       [59],
       [76],
       [76],
       [91],
       [69],
       [78],
       [71],
       [80],
       [85]])
In [6]:
from sklearn.preprocessing import KBinsDiscretizer

dis = KBinsDiscretizer(n_bins=3,
                       encode="ordinal",
                       strategy="uniform"
                      )
label_uniform = dis.fit_transform(score)  # 转换器
In [7]:
label_uniform
Out[7]:
array([[2.],
       [1.],
       [1.],
       [0.],
       [2.],
       [0.],
       [2.],
       [1.],
       [1.],
       [1.],
       [2.],
       [0.],
       [1.],
       [1.],
       [2.],
       [0.],
       [1.],
       [1.],
       [1.],
       [2.]])
In [8]:
dis.bin_edges_
Out[8]:
array([array([56.        , 69.66666667, 83.33333333, 97.        ])],
      dtype=object)
In [9]:
dis = KBinsDiscretizer(n_bins=3,
                       encode="ordinal",
                       strategy="quantile"
                      )

label_quantile = dis.fit_transform(score) 
In [10]:
dis = KBinsDiscretizer(n_bins=3,
                       encode="ordinal",
                       strategy="kmeans"
                      )

label_kmeans = dis.fit_transform(score)  # 转换器
D:\softwares\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1036: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1.
  warnings.warn(
In [11]:
label_kmeans
Out[11]:
array([[2.],
       [1.],
       [1.],
       [0.],
       [2.],
       [0.],
       [2.],
       [1.],
       [1.],
       [1.],
       [2.],
       [0.],
       [1.],
       [1.],
       [2.],
       [0.],
       [1.],
       [1.],
       [1.],
       [2.]])
In [12]:
df["label_uniform"] = label_uniform
df["label_quantile"] = label_quantile
df["label_kmeans"] = label_kmeans
In [13]:
df
Out[13]:
ID SCORE label_uniform label_quantile label_kmeans
0 1 96 2.0 2.0 2.0
1 2 73 1.0 0.0 1.0
2 3 74 1.0 1.0 1.0
3 4 69 0.0 0.0 0.0
4 5 88 2.0 2.0 2.0
5 6 56 0.0 0.0 0.0
6 7 97 2.0 2.0 2.0
7 8 72 1.0 0.0 1.0
8 9 83 1.0 2.0 1.0
9 10 77 1.0 1.0 1.0
10 11 94 2.0 2.0 2.0
11 12 59 0.0 0.0 0.0
12 13 76 1.0 1.0 1.0
13 14 76 1.0 1.0 1.0
14 15 91 2.0 2.0 2.0
15 16 69 0.0 0.0 0.0
16 17 78 1.0 1.0 1.0
17 18 71 1.0 0.0 1.0
18 19 80 1.0 1.0 1.0
19 20 85 2.0 2.0 2.0